Evaluating Elevation, Experience & Body Type impact on Football Players
Fall 2024 Data Science Project
Luke Walker, Evan Losin, Owen Davitz
Contributions:
For each member, list which of the following sections they worked on, andsummarize the contributions in
1-2 sentences. Be specific!
A: Project idea: Evan had the Idea to look into sports statistics, Luke found the actual Dataset.
B: Dataset Curation and Preprocessing: Since there were 3 of us in the group and 3 sections needed for
each step, we split it accordingly. Luke did the piece about how experience effects quarterbacks, Evan did
how height linked to yards per reception, and Owen did how elevation affects field goal accuracy.
C: Data Exploration and Summary Statistics: Luke did the piece about how experience effects quarterbacks,
Evan did how height linked to yards per reception, and Owen did how elevation affects field goal
accuracy.
D: ML Algorithm Design/Development: As we each had the most experience with our own data that we had done
previously, we split the machine learning models accordingly.
E: ML Algorithm Training and Test Data Analysis. We continued working on the parts we had from the start.
F: Visualization, Result Analysis, Conclusion: We each did these for the Machine learning part we
took.
G: Final Tutorial Report Creation: We all met to compile the components we worked on into the required
format. \
Intro¶
For our project we chose to investigate how different stats related to football affect the performance of players. As we all have an interest in the NFL we wanted to see if we could uncover lesser know relations between different stats. We each decided to investigate into a different area. Luke looked into how a quaterback's experience effects their play, Evan looked into the relationship between height and yards per catch, and Owen looked into how a kickers birth elevaiton affects their performance at different elevations. Finding the asnwers to how these stats are realted could help teams decide which players to keep on their teams, and it also provides additional insight to fans.
Data Curation¶
Link to data set: https://www.kaggle.com/datasets/kendallgillies/nflstatistics?select=Basic_Stats.csv. In addition, we had to acquire elevation data which we did partially from the Open-Elevation API and partially from just finding the elevations on Google.
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from geopy.geocoders import Nominatim
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
#Import Career Passing stats
Career_Passing = pd.read_csv('Career_Stats_Passing.csv')
basic_stats = pd.read_csv('Basic_Stats.csv')
career_rec = pd.read_csv('Career_Stats_Receiving.csv')
rec_game_logs = pd.read_csv('Game_Logs_Wide_Receiver_and_Tight_End.csv')
#basic_stats
#Convert the to a numeric type so that it can be properly processed
Career_Passing['Passing Yards'] = pd.to_numeric(Career_Passing['Passes Completed'], errors='coerce') * 10
Career_Passing['Games Played'] = pd.to_numeric(Career_Passing['Games Played'], errors='coerce')
Career_Passing['Yards per Game'] = Career_Passing['Passing Yards'] / Career_Passing['Games Played']
#load csv files
basic_stats_df = pd.read_csv('Basic_Stats.csv')
kicker_career_df = pd.read_csv('Career_Stats_Field_Goal_Kickers.csv')
kicker_game_df = pd.read_csv('Game_Logs_Kickers.csv')
#set player ids to strings
basic_stats_df['Player Id'] = basic_stats_df['Player Id'].astype(str)
kicker_game_df['Player Id'] = kicker_game_df['Player Id'].astype(str)
kicker_career_df['Player Id'] = kicker_career_df['Player Id'].astype(str)
#rename career field goal percentage
kicker_career_df['Career FG Percentage'] = kicker_career_df['FG Percentage']
kicker_career_df['Career FG Percentage'] = pd.to_numeric(kicker_career_df['Career FG Percentage'], errors='coerce')
kicker_career_df.dropna(inplace=True)
#remove kickers that didnt play a whole season
kicker_career_df['Total Games'] = kicker_career_df.groupby('Player Id')['Games Played'].transform('sum')
kicker_career_df = kicker_career_df[kicker_career_df['Total Games'] > 16]
#get birthplace of all kickers
joined_df = pd.merge(basic_stats_df, kicker_career_df, left_on=basic_stats_df['Player Id'], right_on=kicker_career_df['Player Id'])
joined_df.drop(['Position_x', 'Position_y'], axis=1)
kickers_df = joined_df[['Player Id_x', 'Birth Place']]
kickers_df = kickers_df.drop_duplicates()
#code used to get elevations for player birthplaces
elevation_df = pd.DataFrame()
elevation_df['Location'] = kickers_df['Birth Place'].unique()
elevation_df['Elevation'] = None
def get_coordinates(city):
geolocator = Nominatim(user_agent="city_elevation", timeout=20)
location = geolocator.geocode(city)
if location:
return (location.latitude, location.longitude)
else:
location = geolocator.geocode(city.split()[-1])
if location:
return (location.latitude, location.longitude)
else:
print(f"Could not find coordinates for {city}")
return None
def get_elevation(lat, lon):
url = f'https://api.open-elevation.com/api/v1/lookup?locations={lat},{lon}'
response = requests.get(url)
if response.status_code == 200:
elevation_data = response.json()
return elevation_data['results'][0]['elevation']
else:
print("Error retrieving elevation data")
return None
def get_elevations(cities):
for city in cities:
coords = get_coordinates(city)
if coords:
elevation = get_elevation(coords[0], coords[1])
if elevation is not None:
elevation_df.loc[elevation_df['Location'] == city, 'Elevation'] = elevation
#code to generate csv
#get_elevations(elevation_df['Location'])
#elevation_df.to_csv('birth_elevations.csv')
#there were a couple that didn't work so I filled them in manually
#import birth elevation
birth_elevation_df = pd.read_csv('birth_elevations.csv')
birth_elevation_df.sort_values('Elevation')
#code to get elevations for all team locations
#I just did these manually
team_elevation_df = pd.read_csv('team_elevations.csv')
team_elevation_df
#map team names to abbreviations
team_mapping = {
"New Orleans Saints": "NO",
"Oakland Raiders": "OAK",
"New York Giants": "NYG",
"Washington Redskins": "WAS",
"Carolina Panthers": "CAR",
"Buffalo Bills": "BUF",
"New York Jets": "NYJ",
"Pittsburgh Steelers": "PIT",
"Baltimore Ravens": "BAL",
"Detroit Lions": "DET",
"Miami Dolphins": "MIA",
"Dallas Cowboys": "DAL",
"San Francisco 49ers": "SF",
"Houston Texans": "HOU",
"Cleveland Browns": "CLE",
"St. Louis Rams": "STL",
"San Diego Chargers": "SD",
"Minnesota Vikings": "MIN",
"Cincinnati Bengals": "CIN",
"Arizona Cardinals": "ARI",
"Green Bay Packers": "GB",
"Tennessee Titans": "TEN",
"Seattle Seahawks": "SEA",
"Atlanta Falcons": "ATL",
"Kansas City Chiefs": "KC",
"Jacksonville Jaguars": "JAX",
"Tampa Bay Buccaneers": "TB",
"Denver Broncos": "DEN",
"Indianapolis Colts": "IND",
"Chicago Bears": "CHI",
"New England Patriots": "NE",
"Philadelphia Eagles": "PHI",
"Los Angeles Raiders": "RAI",
"Los Angeles Rams": "RAM",
"Phoenix Cardinals": "PHO",
"Boston Patriots": "BOS"
}
Exploratory Data Analysis (LUKE)¶
The first hypothesis is that a quarterback who played more games would have a higher avg of passing yards per game when compared to one with less games played.
#The data we need is already converted into a format which we can process above
#Since there is so much data, it needs to be cleaned so that each data point makes sense in this context.
#This dops the missing or invalid rows from the two columns below
valid_data = Career_Passing.dropna(subset=['Yards per Game', 'Games Played'])
#Find the median games played so we know where to split it
median_games = valid_data['Games Played'].median()
#Split the data into two sections, those who played more and less than the median
more_games_group = valid_data[valid_data['Games Played'] > median_games]
fewer_games_group = valid_data[valid_data['Games Played'] <= median_games]
#Calculate the avg yards per game for each quarterback in the group
more_games_avg_ypg = more_games_group['Yards per Game'].mean()
fewer_games_avg_ypg = fewer_games_group['Yards per Game'].mean()
#Print findings
print(f'Avg passing yards per game of a quarterback with more games: {more_games_avg_ypg}')
print(f'Avg passing yards per game of a quarterback with less games: {fewer_games_avg_ypg}')
#Make Box Plot to represent findings
data = [fewer_games_group['Yards per Game'], more_games_group['Yards per Game']]
labels = ['Fewer Games Played', 'More Games Played']
plt.boxplot(data, labels=labels)
plt.title('Comparison of Passing Yards per Game: Fewer vs More Games Played')
plt.ylabel('Passing Yards per Game')
plt.show()
Avg passing yards per game of a quarterback with more games: 71.88551732693666 Avg passing yards per game of a quarterback with less games: 54.688227919767066
Three Conclusions Drawn From This Data:
- Descriptive statistics
- Correlation analysis
- Hypothesis testing
Descriptive Statistics
The data description shows the overview of the data set, including things such as the total count of how
many data points were used, the mean, standard deviation, min, max, and quartile stats for the games
played, pass attempts and passer rating. It helps us understand the overall distribution of these key
statistics.
dataset_summary = Career_Passing.describe()
print(dataset_summary)
plt.hist(valid_data['Games Played'], bins=15)
plt.title('Distribution of Games Played')
plt.xlabel('Games Played')
plt.ylabel('Frequency')
plt.show()
Year Games Played Pass Attempts Per Game Passing Yards \
count 8525.000000 8525.000000 8525.000000 4347.000000
mean 1982.052551 10.294311 5.787824 687.941109
std 23.822176 5.305723 10.533562 1021.944299
min 1924.000000 0.000000 0.000000 0.000000
25% 1965.000000 6.000000 0.000000 10.000000
50% 1985.000000 12.000000 0.000000 110.000000
75% 2003.000000 15.000000 5.900000 1080.000000
max 2016.000000 17.000000 51.000000 4710.000000
Passer Rating Yards per Game
count 8525.000000 4337.000000
mean 32.226111 62.971635
std 40.485956 73.739956
min 0.000000 0.000000
25% 0.000000 0.714286
50% 0.000000 25.000000
75% 64.900000 120.714286
max 158.300000 294.375000
Correlation Analysis
There is a weak positive correlation (0.13) between the number of games played and passing yards per game. While not a strong relationship, quarterbacks who play more games tend to have slightly better averages.
correlation = valid_data['Games Played'].corr(valid_data['Yards per Game'])
print(f"Correlation between Games Played and Yards per Game: {correlation}")
plt.scatter(valid_data['Games Played'], valid_data['Yards per Game'])
plt.title('Games Played vs Yards per Game')
plt.xlabel('Games Played')
plt.ylabel('Yards per Game')
plt.show()
Correlation between Games Played and Yards per Game: 0.13059217810090026
Hypothesis Testing
The two-sample t-test (t = 7.63, p-value = 2.92e-14) shows a statistically significant difference between quarterbacks who played more games and those who played fewer games. Players with more games on average tend to have significantly higher passing yards per game.
Null Hypothesis (H₀): There is a significant difference in the average passing yards per game between quarterbacks who played more games and those who played fewer games.
Alternative Hypothesis (H₁): There is no significant difference in the average passing yards per game between quarterbacks who played more games and those who played fewer games.
Based on the two-sample t-test (t = 7.63, p-value = 2.92e-14), we reject the null hypothesis that there is a significant difference in the average passing yards per game between quarterbacks who played more games and those who played fewer games.
t_stat, p_value = stats.ttest_ind(
more_games_group['Yards per Game'],
fewer_games_group['Yards per Game'],
equal_var=False
)
print(f'T-statistic: {t_stat}')
print(f'P-value: {p_value}')
means = [fewer_games_avg_ypg, more_games_avg_ypg]
labels = ['Fewer Games Played', 'More Games Played']
plt.bar(labels, means, yerr=[stats.sem(fewer_games_group['Yards per Game']), stats.sem(more_games_group['Yards per Game'])])
plt.title('Average Yards per Game by Group')
plt.ylabel('Average Yards per Game')
plt.show()
T-statistic: 7.6318812441510415 P-value: 2.919898696659963e-14
Primary Analysis/Visualizations (LUKE)¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn import tree
import matplotlib.pyplot as plt
features = ['Games Played', 'Passes Attempted', 'Passes Completed', 'Completion Percentage', 'TD Passes', 'Ints', 'Passer Rating']
target = 'Yards per Game'
valid_data = Career_Passing.dropna(subset=features + [target])
X = valid_data[features]
y = valid_data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
non_numeric_columns = X_train.select_dtypes(include=['object']).columns
print(f"Non-numeric columns: {non_numeric_columns.tolist()}")
for column in non_numeric_columns:
X_train[column] = pd.to_numeric(X_train[column], errors='coerce')
X_test[column] = pd.to_numeric(X_test[column], errors='coerce')
imputer = SimpleImputer(strategy='mean')
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)
dt_regressor = DecisionTreeRegressor(random_state=42)
param_grid = {
'max_depth': [3, 5],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
}
grid_search = GridSearchCV(
estimator=dt_regressor,
param_grid=param_grid,
cv=5,
scoring='r2',
n_jobs=-1,
error_score='raise'
)
try:
grid_search.fit(X_train, y_train)
except Exception as e:
print(f"An error occurred during model fitting: {e}")
raise
best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r_squared}')
importances = best_dt.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.bar(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.title('Feature Importances from Decision Tree Regressor')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.show()
plt.figure(figsize=(200, 100))
tree.plot_tree(best_dt, feature_names=features, filled=True)
plt.title('Decision Tree Structure')
plt.show()
Non-numeric columns: ['Passes Attempted', 'Passes Completed', 'Completion Percentage', 'TD Passes', 'Ints']
Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
Mean Squared Error: 238.84329167011956
R-squared: 0.954232577202681
Conclusions & Insights(LUKE) :¶
The Decision Tree Regressor has provided valuable insights into the factors affecting a quarterback's average passing yards per game. The key takeaways are that efficiency is key, experience alone is not enough and shows a more data-driven strategy. Passer rating and completion percentage are strong predictors. Simply playing more games does not significantly increase per-game averages. Using these insights, teams can create training and enhance strategies to boost quarterback performance. By focusing on the most influential factors, coaches and players can make informed decisions to improve outcomes on the field.
The second hypothesis is that receivers with taller heights average more yards per reception than
shorter receivers
Null Hypothesis: Receivers with taller heights do not average more yards per reception than shorter
receivers
Alternative Hypothesis: Receivers with taller heights do average more yards per reception¶
Exploratory Data Analysis (Evan)¶
#Merging and cleaning the data
df = pd.merge(basic_stats, career_rec, on='Player Id')
df = pd.merge(df, rec_game_logs, on='Player Id')
df = df[['Player Id', 'Height (inches)', 'Receptions_x', 'Receiving Yards_x']]
df['Receiving Yards_x'] = pd.to_numeric(df['Receiving Yards_x'], errors='coerce')
df['Receptions_x'] = pd.to_numeric(df['Receptions_x'], errors='coerce')
df = df[df['Receptions_x'] >= 40]
df = df[df['Receiving Yards_x'] >= 500]
df = df.dropna()
df = df.groupby('Player Id').mean().reset_index()
df['Yards Per Reception'] = df['Receiving Yards_x'] / df['Receptions_x']
Hypothesis testing: Using a two sample t-test, we obtained a p value of 0.185 meaning we fail to reject
the null hypothesis.
Therefore, we come to the conclusion that receivers with taller heights do not average more yards per reception than shorter receivers.¶
#T Test
taller = df[df['Height (inches)'] >= 73]
shorter = df[df['Height (inches)'] < 73]
t, p = stats.ttest_ind(taller['Yards Per Reception'], shorter['Yards Per Reception'])
x_labs = ['Players shorter than 73 inches', 'Players 73 inches and taller']
y_labs = [shorter['Yards Per Reception'].mean(), taller['Yards Per Reception'].mean()]
plt.bar(x_labs, y_labs)
plt.title("Average Yards per Reception by Height Groups")
plt.ylabel("Yards per Reception")
print('t-statistic:', t)
print('p-value:', p)
t-statistic: -1.3272983734848558 p-value: 0.1851522343819193
Correlation Analysis: There is a very weak negative relationship between height and average yards per reception. This can be seen through the scatterplot below.¶
#Correlation
corr = df['Height (inches)'].corr(df['Yards Per Reception'])
#Scatter Plot
plt.scatter(df['Height (inches)'], df['Yards Per Reception'])
plt.xlabel('Height (inches)')
plt.ylabel('Yards Per Reception')
plt.title('Height vs. Yards Per Reception')
print(corr)
-0.13695887457982833
#Summary
summary = df.describe()
plt.hist(df['Height (inches)'], bins=14, edgecolor='black')
plt.title('Distribution of Heights')
plt.xlabel('Height (inches)')
plt.ylabel('Frequency')
print(summary)
Height (inches) Receptions_x Receiving Yards_x Yards Per Reception count 410.000000 410.000000 410.000000 410.000000 mean 73.336585 53.763172 725.470488 13.669289 std 2.508446 8.664600 110.465848 2.165105 min 67.000000 40.000000 504.000000 8.775862 25% 71.000000 47.000000 643.000000 12.123745 50% 73.000000 53.000000 735.666667 13.513132 75% 75.000000 59.000000 806.750000 15.084034 max 80.000000 84.000000 998.000000 21.744186
Machine Learning Primary Analysis (Evan)¶
ML Analysis: Regression¶
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
X = df['Height (inches)'].values.reshape(-1, 1)
y = df['Yards Per Reception'].values
# Split the data
def split_data(X, Y, test_size=0.2, random_state=42):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=random_state)
return X_train, X_test, Y_train, Y_test
X_train, X_test, y_train, y_test = split_data(X, y)
# Define functions
def fit_model(X_train, Y_train):
model = LinearRegression()
model.fit(X_train, Y_train)
return model
def predict_data(model, X_train, X_test):
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)
return Y_train_pred, Y_test_pred
def evaluate_model(Y_train, Y_train_pred, Y_test, Y_test_pred):
mse_train = mean_squared_error(Y_train, Y_train_pred)
mse_test = mean_squared_error(Y_test, Y_test_pred)
r2_train = r2_score(Y_train, Y_train_pred)
r2_test = r2_score(Y_test, Y_test_pred)
return mse_train, mse_test, r2_train, r2_test
# Create polynomial features
def create_polynomial_features(X, degree):
poly = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly.fit_transform(X)
return X_poly, poly
results = []
# Iterate through degrees 1 to 6
for degree in range(1, 7):
X_poly_train, poly_transformer = create_polynomial_features(X_train, degree)
X_poly_test = poly_transformer.transform(X_test)
poly_model = fit_model(X_poly_train, y_train)
Y_train_pred, Y_test_pred = predict_data(poly_model, X_poly_train, X_poly_test)
mse_train, mse_test, r2_train, r2_test = evaluate_model(y_train, Y_train_pred, y_test, Y_test_pred)
results.append((f'Polynomial Regression (degree {degree})', mse_train, mse_test, r2_train, r2_test))
results_df = pd.DataFrame(results, columns=['Model', 'MSE Train', 'MSE Test', 'R2 Train', 'R2 Test'])
print(results_df)
Model MSE Train MSE Test R2 Train R2 Test 0 Polynomial Regression (degree 1) 4.290609 5.783732 0.019195 0.013312 1 Polynomial Regression (degree 2) 4.136383 5.575286 0.054450 0.048872 2 Polynomial Regression (degree 3) 4.102971 5.693996 0.062088 0.028621 3 Polynomial Regression (degree 4) 4.056098 5.453387 0.072803 0.069668 4 Polynomial Regression (degree 5) 4.043953 5.559581 0.075579 0.051552 5 Polynomial Regression (degree 6) 4.044451 5.552693 0.075465 0.052727
Visualization¶
# Visualization of the curve for all degrees
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Train')
plt.scatter(X_test, y_test, color='orange', label='Test')
# Plot regression curves for each degree
for degree in range(1, 7):
X_poly_train, poly_transformer = create_polynomial_features(X_train, degree)
X_sorted = np.sort(X_train, axis=0)
Y_poly_sorted_pred = fit_model(X_poly_train, y_train).predict(poly_transformer.transform(X_sorted))
plt.plot(X_sorted, Y_poly_sorted_pred, label=f'Degree {degree}')
plt.xlabel('Height (inches)')
plt.ylabel('Yards Per Reception')
plt.title('Polynomial Regression: Height vs. Yards Per Reception')
plt.legend()
plt.grid(True)
plt.show()
Conclusion and insights: By utilizing polynomial regression of various degrees, I was able to determine the best polynomial fit for the data. Based on the machine learning models I evaluated for polynomial degrees from 1 through 6, a degree 4 polynomial best fits our data giving us the lowest MSE and higher R squared value due to the nonlinear shape of the data. Although, the error and R squared values are still not within a range that would bring me to the conclusion that height and yards per reception are correlated. This aligns with my analysis from the correlation and p value which signified little to no relationship between the two variables¶
Exploratory Data Analysis (Owen)¶
My topic relates to the performance of NFL kickers at different attitudes. Specifically, I wanted to answer the question “Do kickers who were born at higher altitudes perform better at higher altitudes?” I feel this is an important question to answer because of the weight placed on kickers in NFL games. Often the winner of a game is decided by a last second kick, meaning that a good kicking performance directly impacts the outcome of the game. In addition, whenever a game is played in high altitude there is always talk about what effects it could have on the visiting team. Therefore we can assume that having a kicker a team knows will perform well at high altitudes would increase the chance they win high altitude games.
The third hypothesis is that kickers from higher altitudes are more accurate at higher atitudes than kickers from lower altitudes.
#find performance of kickers at each team location
#get all needed info in one df
merged_df = pd.merge(basic_stats_df[['Birth Place', 'Player Id']], birth_elevation_df, left_on=basic_stats_df['Birth Place'], right_on=birth_elevation_df['Location'], how='inner')
merged_df = merged_df.merge(kicker_game_df[['Year', 'Player Id', 'Home or Away', 'Opponent', 'Longest FG Made', 'FGs Attempted', 'FGs Made', 'FG Percentage', 'Extra Points Attempted', 'Extra Points Made', 'Percentage of Extra Points Made']], on='Player Id', how='inner')
merged_df = merged_df.merge(kicker_career_df[['Player Id', 'Year', 'Team', 'Career FG Percentage']], on=['Player Id', 'Year'], how='inner')
#turn team names into abbreviations
merged_df['Team'] = merged_df['Team'].map(team_mapping).fillna(merged_df['Team'])
#get the location of the game
def get_location(row):
if row['Home or Away'] == 'Home':
return row['Team']
else:
return row['Opponent']
merged_df['Game Location'] = merged_df.apply(get_location, axis=1)
#remove stats from pro-bowl(APR, NPR, RIC, CRT), played at different places each year
merged_df = merged_df[~merged_df['Opponent'].isin(['APR', 'NPR', 'RIC', 'CRT'])]
#make new col for game elevation
merged_df = merged_df.merge(team_elevation_df, on='Game Location', how='inner')
#convert FG percentage to numeric
merged_df['FG Percentage'] = pd.to_numeric(merged_df['FG Percentage'], errors='coerce')
merged_df.dropna(inplace=True)
merged_df.sort_values('Game Elevation')
| key_0 | Birth Place | Player Id | Unnamed: 0 | Location | Elevation | Year | Home or Away | Opponent | Longest FG Made | FGs Attempted | FGs Made | FG Percentage | Extra Points Attempted | Extra Points Made | Percentage of Extra Points Made | Team | Career FG Percentage | Game Location | Game Elevation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1625 | Baton Rouge , LA | Baton Rouge , LA | stephengostkowski/2506922 | 8 | Baton Rouge , LA | 28.0 | 2009 | Away | NO | 36 | 2 | 1 | 50.0 | 2 | 2 | 100.0 | NE | 83.9 | NO | 1.0 |
| 2917 | Yankton , SD | Yankton , SD | adamvinatieri/2503471 | 20 | Yankton , SD | 368.0 | 1998 | Away | NO | 49 | 3 | 3 | 100.0 | 3 | 3 | 100.0 | NE | 79.5 | NO | 1.0 |
| 1425 | Omaha , NE | Omaha , NE | dancarpenter/2507401 | 6 | Omaha , NE | 331.0 | 2009 | Away | NO | 41 | 1 | 1 | 100.0 | 1 | 1 | 100.0 | MIA | 89.3 | NO | 1.0 |
| 1445 | Omaha , NE | Omaha , NE | dancarpenter/2507401 | 6 | Omaha , NE | 331.0 | 2008 | Away | NO | 0 | 1 | 0 | 0.0 | 2 | 2 | 100.0 | MIA | 84.0 | NO | 1.0 |
| 1487 | Baton Rouge , LA | Baton Rouge , LA | stephengostkowski/2506922 | 8 | Baton Rouge , LA | 28.0 | 2015 | Away | NO | 36 | 3 | 2 | 66.7 | 2 | 2 | 100.0 | NE | 91.7 | NO | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1162 | Mayfield Heights , OH | Mayfield Heights , OH | mattprater/2506677 | 5 | Mayfield Heights , OH | 329.0 | 2013 | Home | WAS | 19 | 1 | 1 | 100.0 | 6 | 6 | 100.0 | DEN | 96.2 | DEN | 1684.0 |
| 3709 | Orange , TX | Orange , TX | mattbryant/2504797 | 27 | Orange , TX | 8.0 | 2008 | Away | DEN | 33 | 2 | 2 | 100.0 | 1 | 1 | 100.0 | TB | 84.2 | DEN | 1684.0 |
| 1158 | Mayfield Heights , OH | Mayfield Heights , OH | mattprater/2506677 | 5 | Mayfield Heights , OH | 329.0 | 2013 | Home | PHI | 53 | 1 | 1 | 100.0 | 7 | 7 | 100.0 | DEN | 96.2 | DEN | 1684.0 |
| 1177 | Mayfield Heights , OH | Mayfield Heights , OH | mattprater/2506677 | 5 | Mayfield Heights , OH | 329.0 | 2012 | Home | SF | 53 | 1 | 1 | 100.0 | 3 | 3 | 100.0 | DEN | 81.3 | DEN | 1684.0 |
| 1232 | Mayfield Heights , OH | Mayfield Heights , OH | mattprater/2506677 | 5 | Mayfield Heights , OH | 329.0 | 2010 | Home | STL | 49 | 2 | 2 | 100.0 | 3 | 3 | 100.0 | DEN | 88.9 | DEN | 1684.0 |
3221 rows × 20 columns
Read more about heat maps: https://www.atlassian.com/data/charts/heatmap-complete-guide
#create heat map that shows how kickers born at different elevations perform at different elevations
#prepare the data for the heatmap
heatmap_data = merged_df.pivot_table(values='FG Percentage',
index='Game Elevation',
columns='Elevation',
aggfunc='mean')
#create a custom color map from green to red
cmap = sns.diverging_palette(0, 120, as_cmap=True)
#create heatmap
plt.figure(figsize=(14, 10)) #increase figure size
sns.heatmap(heatmap_data,
cmap=cmap, #use custom color map
annot=True, #enable annotations
fmt='.1f', #format for annotations
annot_kws={"size": 8}, #adjust annotation size
linewidths=0.5, #width of lines that will divide each cell
cbar_kws={'label': 'Average Kicker Accuracy (%)',
'ticks': np.arange(0, 101, 10)}) #color bar label and ticks
#label axes and title
plt.xlabel('Kicker Birthplace Elevation (m)', fontsize=16)
plt.ylabel('Game Elevation (m)', fontsize=16)
plt.title('Heatmap of Kicker Accuracy by Elevations', fontsize=18, fontweight='bold')
#customize ticks
plt.xticks(fontsize=12, rotation=45) # Rotate x-ticks for better visibility
plt.yticks(fontsize=12)
#set major ticks to reduce clutter
plt.locator_params(axis='x', nbins=10) # Reduce the number of x-ticks
plt.locator_params(axis='y', nbins=10) # Reduce the number of y-ticks
#show plot
plt.tight_layout() #adjust layout for better fit
plt.show()
#scatter plot that shows birth elevation vs kicker accuracy at above the average U.S. elevation of 763
#filter the DataFrame for the specified game elevation
median_elevation = merged_df['Game Elevation'].median()
filtered_df = merged_df[merged_df['Game Elevation'] >= median_elevation]
#data for the scatter plot
birthplace_elevation = filtered_df['Elevation']
kicker_accuracy = filtered_df['FG Percentage']
#create scatter plot
plt.scatter(birthplace_elevation, kicker_accuracy,
color='blue',
alpha=0.7, #set transparency
edgecolors='w') #add white edges to points
#label axes and title
plt.xlabel('Kicker Birthplace Elevation (m)', fontsize=14)
plt.ylabel('Kicker Accuracy (%)', fontsize=14)
plt.title('Kicker Accuracy vs Birth Elevation Above Median Elevation', fontsize=16, fontweight='bold')
#customize ticks
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
#add grid for better visibility
plt.grid(color='grey', linestyle='--', linewidth=0.5)
#show plot
plt.show()
Hypothesis Testsing:
Group 1: Kickers born below the median elevation of games played.
Group 2: Kickers born above the median elevation of games played.
H0: There is no significant difference in kicker accuracy at elevation above the median elevation of games played for the two groups.
Ha: There is a significant difference in kicker accuracy at elevation above the median elevation of games played for the two groups.
#do hypothesis testing
from scipy import stats
#calculate the median game elevation
median_game_elevation = merged_df['Game Elevation'].median()
#split the data into two groups based on if they were born above or below median game elevation
group_below_median = merged_df[merged_df['Elevation'] < median_game_elevation]
group_above_median = merged_df[merged_df['Elevation'] >= median_game_elevation]
#get accuracy for both groups
accuracy_below = group_below_median['FG Percentage']
accuracy_above = group_above_median['FG Percentage']
#perform t-test, want to compare means of two groups and have sample size > 30
t_stat, p_value = stats.ttest_ind(accuracy_below, accuracy_above)
test_type = "t-test"
#print results
print(f"P-Value: {p_value}")
P-Value: 0.9912679241998194
Using a significance level of 0.05 we fail to reject the null hypothesis that there is a difference in accruacy between the two groups at above the median game elevation. This is because 0.76 is not less than 0.05.
Correlation analysis:
Very low negative correlation which shows that birth elevation has little to no relationship with kicking performance.
#create scatter plot to show overall performance of kickers from different elevations
#data for the scatter plot
birth_elevation = merged_df['Elevation']
career_fg_percentage = merged_df['Career FG Percentage']
#calculate correlation between birth elevation anf field goal percentage
correlation = merged_df['Elevation'].corr(merged_df['Career FG Percentage'])
print(f"Correlation between Birth Elevation and Career FG Percentage: {correlation}")
#create scatter plot
plt.scatter(birth_elevation, career_fg_percentage, alpha=0.6)
#label axes and title
plt.xlabel('Kicker Birth Elevation (m)')
plt.ylabel('Career FG Percentage (%)')
plt.title('Scatter Plot of Career FG Percentage vs Birth Elevation')
#show plot
plt.grid()
plt.show()
Correlation between Birth Elevation and Career FG Percentage: -0.07681920437272324
Read more about correlation: https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/11-correlation-and-regression
Dataset descriptive analysis:
#print summary statistics for dataset
print(merged_df.describe())
#generate histogram for game elevations
plt.hist(merged_df['Elevation'])
plt.ylabel('Number of Kickers')
plt.xlabel('Elevation (m)')
plt.title('Count of Kickers From Varying Elevations')
plt.show()
Unnamed: 0 Elevation Year FG Percentage \
count 3221.000000 3221.000000 3221.000000 3221.000000
mean 12.780503 234.675877 2011.137845 83.349922
std 9.312714 254.228598 4.504546 28.901746
min 0.000000 7.000000 1996.000000 0.000000
25% 4.000000 23.000000 2009.000000 66.700000
50% 11.000000 174.000000 2012.000000 100.000000
75% 20.000000 368.000000 2015.000000 100.000000
max 30.000000 979.000000 2016.000000 100.000000
Career FG Percentage Game Elevation
count 3221.000000 3221.000000
mean 83.428469 213.067991
std 7.714799 356.085307
min 0.000000 1.000000
25% 79.400000 51.000000
50% 84.200000 135.000000
75% 88.900000 225.000000
max 100.000000 1684.000000
From this graph we can tell that while we have a good range of kickers from varying elevations, the data is heavily favored in lower elevations. This can be explained by there being less places of high elevation.
Machine Learning Primary Analysis (Owen)¶
I will use a clustering model to group the kickers. The first step it to find k with the elbow method.
I decided to use a clustering model to help determine if kickers born at higher altitudes perform better there. I chose clustering because if there was a positive relationship between birth altitude and performance at high altitude I should be able to see a clear cluster of players with high performance and high birth elevations. For the specific model I chose to use a K means clustering model. To find a good K value I used the elbow method and also checked the silhouette scores at different K values. After doing both of these methods I settled on K = 3.
#standardize data
features = merged_df[['Elevation', 'Game Elevation', 'FG Percentage',
'Longest FG Made', 'Extra Points Made', 'Career FG Percentage']]
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
inertia = []
# Test different numbers of clusters
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(features_scaled)
inertia.append(kmeans.inertia_)
# Plot the elbow curve
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
Next I will use silhouette scores to get more info on the best K value.
# List to store silhouette scores
silhouette_scores = []
# Test different values of k (number of clusters)
k_values = range(2, 11) # Start from 2 because silhouette score requires at least 2 clusters
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(features_scaled) # features_scaled is your standardized data
score = silhouette_score(features_scaled, clusters)
silhouette_scores.append(score)
# Plot the silhouette scores
plt.plot(k_values, silhouette_scores, marker='o', linestyle='-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs Number of Clusters')
plt.show()
Visualization (Owen)¶
For my analysis I decided to use a 3d scatter plot where each point represents a game played by a kicker. I chose a scatter plot so I could plot each game and visualize the clusters that were created by the K means clustering model. I chose to do a 3d plot because I needed to see how both birth elevation and game elevation affect the performance of a kicker. To do this I put game elevation on the x axis, birth elevation on the y axis, and kicker FG percentage on the z axis.
From these I will use k = 3. Now to do the actual clustering.
# Fit the model with the chosen number of clusters
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(features_scaled)
# Add the cluster labels to the original dataframe
merged_df['Cluster'] = clusters
# Define distinct colors for each cluster (adjust as needed)
cluster_colors = ['red', 'green', 'blue']
# Add a 3D plot
fig = plt.figure(figsize=(20, 16))
ax = fig.add_subplot(111, projection='3d')
# Plot the data with clusters using individual colors
for cluster in range(3): # Assuming 3 clusters
cluster_data = merged_df[merged_df['Cluster'] == cluster]
ax.scatter(cluster_data['Elevation'],
cluster_data['Game Elevation'],
cluster_data['FG Percentage'],
color=cluster_colors[cluster], label=f'Cluster {cluster}')
# Labels and title
ax.set_xlabel('Elevation')
ax.set_ylabel('Game Elevation')
ax.set_zlabel('FG Percentage')
ax.set_title('3D Cluster Plot')
# Show the legend to identify cluster colors
ax.legend()
plt.show()
# Calculate silhouette score using the features from the model (in your case, the scaled features)
score = silhouette_score(features_scaled, clusters)
print(f'Silhouette Score: {score}')
Silhouette Score: 0.3922277414278905
Read more about K means clustering: https://www.ibm.com/topics/k-means-clustering
From the visualization we can see the three clusters. One at low game elevation, low FG percentage, and varying birth elevation. Another cluster can be seen at low game elevation, mid to high FG percentage and varying birth elevation. Finally there is a third cluster with high game elevation, and varying birth elevation and FG percentage. In this third cluster we see slightly less data points at high birth elevation and low FG percentage but nothing significant. In addition, there are less data points with high birth elevations, which could make the results unreliable. Due to the variability in the third group’s birth elevation we can say that being born at a higher elevation does not give an advantage when kicking field goals at a higher elevation.
Insights and Conclusion (Owen)¶
From the data and machine learning analysis we can conclude that being born at a higher elevation does not give an advantage when kicking at higher altitudes. For NFL teams this means that instead of looking for a good kicker that was born at higher altitude they should instead just focus on getting the best overall kicker. Getting the best overall kicker will prepare a team better than trying to get a specialized kicker or high altitude.
Overall Conclusion¶
Overall throughout this project we gained valuable skills on how to find, process, and gain insights from data. We found that experience is not much of an indicator of QB efficiency as we would have thought, there was no correlation between heights and yards per reception, and being born in a high altitude does not make you a better kicker at high altitudes. Overall the was a very educational experience that pushed us to grow and learn new things.